1 Goal of the EDA

The goal of this EDA is to understand different parameters which impacts an individual’s choice of Airpot or Airline.

2 Prerequisites for analysis

Following needs to be prior to any analysis:

  • Clear your environment
  • Load required packages
  • Organize work and point to the right directories

Load two dataset provided with this project.

  • Survey data is the record of every individual
  • Airport Airline data is raw data for trating missing values in survey. There are 3 sheets in this file
dat1 <- read_excel("../raw_data/airport_choice_survey_EN_ver2.0_Capstone.xlsx", sheet = 1)
dat2 <- read_excel("../raw_data/Airport_Airline_data.xlsx", sheet = 1)
airline_airfare <- read_excel("../raw_data/Airport_Airline_data.xlsx", sheet = 2)
provincial_distances <- read_excel("../raw_data/Airport_Airline_data.xlsx", sheet = 3)

3 Descriptive Statistics

Summarization of the given data

# Examine data structure 
str(dat1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    488 obs. of  28 variables:
##  $ ID                       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Airport                  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airline                  : num  1 1 1 1 1 1 1 1 3 1 ...
##  $ Age                      : num  49 49 25 29 34 43 43 52 24 52 ...
##  $ Gender                   : num  1 2 1 1 2 1 2 1 2 2 ...
##  $ Nationality              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ TripPurpose              : num  2 1 1 2 2 4 4 1 1 1 ...
##  $ TripDuration             : num  7 4 10 7 4 40 40 7 4 5 ...
##  $ FlyingCompanion          : num  0 4 2 2 0 1 1 1 1 1 ...
##  $ ProvinceResidence        : num  3 3 3 3 3 3 3 3 3 5 ...
##  $ GroupTravel              : num  2 2 2 2 2 2 2 2 2 1 ...
##  $ NoTripsLastYear          : num  3 1 1 2 2 4 0 0 0 2 ...
##  $ FrequentFlightDestination: chr  "1" "1" "2" "5" ...
##  $ Destination              : num  3 2 1 2 1 1 1 2 3 3 ...
##  $ FlightNo                 : chr  "KE627" "KE703" "KE897" "KE725" ...
##  $ DepartureHr              : chr  "12" "10" "11" "15" ...
##  $ DepartureMn              : num  5 10 30 20 0 45 45 15 35 0 ...
##  $ DepartureTime            : num  2 1 1 2 2 2 2 1 3 3 ...
##  $ SeatClass                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airfare                  : num  80 41 NA 40 40 34 34 56 34 NA ...
##  $ NoTransport              : num  1 1 2 1 1 1 1 1 2 2 ...
##  $ ModeTransport            : num  6 6 4 1 6 6 6 6 2 10 ...
##  $ AccessCost               : num  8000 8000 1000 NA 8000 8000 8000 8000 2400 90000 ...
##  $ AccessTime               : num  40 50 20 NA 50 50 50 50 5 60 ...
##  $ Occupation               : num  1 9 12 8 1 1 9 2 12 12 ...
##  $ Income                   : num  5 3 NA 7 3 3 3 3 NA NA ...
##  $ MileageAirline           : chr  "1" NA NA "2" ...
##  $ Mileage                  : num  150000 NA NA 100000 NA NA NA NA NA NA ...
summary(dat1)
##        ID           Airport        Airline           Age       
##  Min.   :  1.0   Min.   :1.00   Min.   :1.000   Min.   :17.00  
##  1st Qu.:122.8   1st Qu.:1.00   1st Qu.:1.000   1st Qu.:29.00  
##  Median :244.5   Median :2.00   Median :2.000   Median :38.00  
##  Mean   :244.5   Mean   :1.51   Mean   :2.423   Mean   :39.97  
##  3rd Qu.:366.2   3rd Qu.:2.00   3rd Qu.:4.000   3rd Qu.:50.00  
##  Max.   :488.0   Max.   :2.00   Max.   :4.000   Max.   :80.00  
##                                 NA's   :10      NA's   :1      
##      Gender       Nationality     TripPurpose     TripDuration   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :  0.00  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:  4.00  
##  Median :2.000   Median :1.000   Median :1.000   Median :  5.00  
##  Mean   :1.528   Mean   :1.484   Mean   :1.518   Mean   : 27.44  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:  8.00  
##  Max.   :2.000   Max.   :5.000   Max.   :4.000   Max.   :730.00  
##  NA's   :3                                                       
##  FlyingCompanion ProvinceResidence  GroupTravel    NoTripsLastYear  
##  Min.   : 0.00   Min.   :1.000     Min.   :1.000   Min.   :  0.000  
##  1st Qu.: 1.00   1st Qu.:1.000     1st Qu.:2.000   1st Qu.:  1.000  
##  Median : 2.00   Median :3.000     Median :2.000   Median :  2.000  
##  Mean   : 2.82   Mean   :3.391     Mean   :1.826   Mean   :  3.262  
##  3rd Qu.: 3.00   3rd Qu.:5.000     3rd Qu.:2.000   3rd Qu.:  3.000  
##  Max.   :34.00   Max.   :8.000     Max.   :2.000   Max.   :122.000  
##                                                                     
##  FrequentFlightDestination  Destination      FlightNo        
##  Length:488                Min.   :1.000   Length:488        
##  Class :character          1st Qu.:1.000   Class :character  
##  Mode  :character          Median :2.000   Mode  :character  
##                            Mean   :2.176                     
##                            3rd Qu.:3.000                     
##                            Max.   :4.000                     
##                            NA's   :5                         
##  DepartureHr         DepartureMn    DepartureTime     SeatClass    
##  Length:488         Min.   : 0.00   Min.   :1.000   Min.   :1.000  
##  Class :character   1st Qu.:15.00   1st Qu.:2.000   1st Qu.:1.000  
##  Mode  :character   Median :30.00   Median :2.000   Median :1.000  
##                     Mean   :25.98   Mean   :2.432   Mean   :1.122  
##                     3rd Qu.:40.00   3rd Qu.:3.000   3rd Qu.:1.000  
##                     Max.   :55.00   Max.   :4.000   Max.   :3.000  
##                     NA's   :120                     NA's   :4      
##     Airfare        NoTransport    ModeTransport      AccessCost    
##  Min.   :  3.00   Min.   :1.000   Min.   : 1.000   Min.   :     0  
##  1st Qu.: 35.00   1st Qu.:1.000   1st Qu.: 2.000   1st Qu.:  2000  
##  Median : 45.00   Median :1.000   Median : 3.000   Median :  6000  
##  Mean   : 50.46   Mean   :1.334   Mean   : 3.801   Mean   : 11220  
##  3rd Qu.: 60.00   3rd Qu.:2.000   3rd Qu.: 6.000   3rd Qu.: 12000  
##  Max.   :260.00   Max.   :4.000   Max.   :11.000   Max.   :350000  
##  NA's   :155                                       NA's   :197     
##    AccessTime       Occupation         Income      MileageAirline    
##  Min.   :  4.00   Min.   : 1.000   Min.   :1.000   Length:488        
##  1st Qu.: 25.00   1st Qu.: 2.000   1st Qu.:2.000   Class :character  
##  Median : 40.00   Median : 8.000   Median :3.000   Mode  :character  
##  Mean   : 51.83   Mean   : 6.768   Mean   :3.126                     
##  3rd Qu.: 60.00   3rd Qu.:12.000   3rd Qu.:4.000                     
##  Max.   :390.00   Max.   :12.000   Max.   :7.000                     
##  NA's   :97                        NA's   :132                       
##     Mileage      
##  Min.   :     1  
##  1st Qu.:  8258  
##  Median : 27998  
##  Mean   : 56384  
##  3rd Qu.: 61808  
##  Max.   :500000  
##  NA's   :398

We have so many missing values in survey data:

  • Neglecting parameters where less than 10 records are missing, DepartureMinute - flight departure time has 120 records missing
  • Airfare - Price of the ticket has 155 records missing
  • Access Cost - Cost from residence to Airport has 197 records missing
  • Access Time - Commute time from residence to Airport has 97 records missing
  • Income - Earing of the survey participant has 132 records missing
  • Mileage - Air miles from airlines for frequent travel customer has 398 records missing

4 Data manipulation

Changing the survey data to make it easier to read and use during EDA process

4.1 Categorical Variables - Mutation

4.1.1 Airport

Modifying the Airport Name for purpose of analysis

  • There are no records in survey data where Airport is missing
# 1. Inchoen (ICN) Airport
# 2. Gimpo (GMP) Airport
# No missing values
dat1 <- dat1 %>%
  mutate(Airport = case_when(Airport == 1 ~ "ICN",
                            Airport == 2 ~ "GMP"))

4.1.2 Airline

Modifying Airline Choice Names

  • There are 10 records in survey data where Airline is missing
  • We will try to get this data using Flight number if provided for these missing records, else remove them for modelling
# 1. Korean Air(KE)
# 2. Asiana Airlines (OZ)
# 3. Korean LCC
# 4. Foreign Airlines

# Replace missing values
dat1 <- dat1 %>%
  mutate(Airline = ifelse(!is.na(Airline), Airline, ifelse(str_detect(FlightNo, "KE"), "Korean Air", "Unknown"))) %>%
  filter(Airline != "Unknown")

dat1 <- dat1 %>%
 mutate(Airline = case_when(Airline == 1 ~ "Korean Air",
                           Airline == 2 ~ "Asiana Airlines",
                           Airline == 3 ~ "Korean LCC",
                           Airline == 4 ~ "Foreign Airlines"))

4.1.3 Gender

Modifying Gender

  • There are 3 records in survey data where Gender is missing
  • We will consider them and Not mentioned instead of filling with mode value
dat1 <- dat1 %>%
  mutate(Gender = if_else(Gender == 1, "Male","Female","Not Mentioned"))

4.1.4 Nationality

Modifying Nationality

  • There are no records in survey data where Nationality is missing
# 1. Korea
# 2. China
# 3. Japan
# 4. Southeast Asia
# 5. Other
dat1 <- dat1 %>%
  mutate(Nationality = case_when(Nationality == 1 ~ "Korea",
                            Nationality == 2 ~ "China",
                            Nationality == 3 ~ "Japan",
                            Nationality == 4 ~ "Southeast Asia",
                            Nationality == 5 ~ "Other"))

4.1.5 Trip Purpose

Modifying Purpose of Trip

  • There are no records in survey data where Trip Purpose is missing
# 1. Leisure Travel (sightseeing, visiting family, vacation, shopping)
# 2. Business Travel 
# 3. Study
# 4. Other
dat1 <- dat1 %>%
  mutate(TripPurpose = case_when(TripPurpose == 1 ~ "Leisure",
                            TripPurpose == 2 ~ "Business",
                            TripPurpose == 3 ~ "Study",
                            TripPurpose == 4 ~ "Other"))

4.1.6 Province Residence

Modifying Province Residence

  • There are no records in survey data where Province Residence is missing but access time and distance from province is missing for some records

4.1.6.1 Data manipulation in Province raw data file to fill missing values in survey

provincial_distances <- provincial_distances %>% 
  separate(Province, c("ProvinceResidence", "Province"), sep = "\\.", remove=FALSE) 

4.1.6.2 Add province distance and time to the airport_survey_data : Join both raw data

# Join the two datasets on airport name and province ids

dat1 <- dat1 %>%
  mutate(ProvinceResidence = as.character(ProvinceResidence)) %>%
  left_join(provincial_distances, by = c("Airport", "ProvinceResidence")) %>%
  select(-c(ProvinceResidence)) %>%
  mutate(Province = as.factor(Province),
          Airport = as.factor(Airport))

4.1.7 Group Travel

Modifying Group Travel

  • There are no records in survey data where Group Travel is missing
# 1. Yes
# 2. No
dat1 <- dat1 %>%
  mutate(GroupTravel = case_when(GroupTravel == 1 ~ 1, GroupTravel == 2 ~ 0))

4.1.8 FrequentFlightDestination

Modifying Frequent Travel Destinations

  • There are no records in survey data where Frequent Travel Destinations is missing
# 1. Southeast Asia
# 2. China
# 3. Japan
# 4. North/South America
# 5. Europe
# 6. Other
# 7. None
dat1 <- dat1 %>%
  mutate(FrequentFlightDestination = case_when(FrequentFlightDestination == 1 ~ "Southeast Asia",
                            FrequentFlightDestination == 2 ~ "China",
                            FrequentFlightDestination == 3 ~ "Japan",
                            FrequentFlightDestination == 4 ~ "North/South America",
                            FrequentFlightDestination == 5 ~ "Europe",
                            FrequentFlightDestination == 6 ~ "Other",
                            FrequentFlightDestination == 7 ~ "None"))

4.1.9 Destination

Modifying Flight Destinations

  • There are 5 records in survey data where Flight Destinations of individual is missing
  • can we get information from flight No?
# 1. China
# 2. Japan
# 3. Southeast Asia
# 4. Other
dat1 <- dat1 %>%
  mutate(Destination = case_when(Destination == 1 ~ "China",
                            Destination == 2 ~ "Japan",
                            Destination == 3 ~ "Southeast Asia",
                            Destination == 4 ~ "Other"))

4.1.10 DepartureTime

Modifying Departure Time

  • There are no records in survey data where Departure Times is missing
# 1. 6am - 12pm
# 2. 12pm - 6pm
# 3. 6pm - 9pm
# 4. 9pm - 6am
dat1 <- dat1 %>%
  mutate(DepartureTime = case_when(DepartureTime == 1 ~ "6am - 12pm",
                            DepartureTime == 2 ~ "12pm - 6pm",
                            DepartureTime == 3 ~ "6pm - 9pm",
                            DepartureTime == 4 ~ "9pm - 6am"))

4.1.11 SeatClass

Modifying Seat Class

  • There are 4 records in survey data where Seat Class is missing
  • If we will use this parameter in some modelling, we can fill missings with mode.
# 1. Economy
# 2. Business
# 3. First Class
dat1 <- dat1 %>%
  mutate(SeatClass = case_when(SeatClass == 1 ~ "Economy",
                            SeatClass == 2 ~ "Business",
                            SeatClass == 3 ~ "First Class"))

4.1.12 ModeTransport

Modifying Mode of Transport

  • There are no records in survey data where Mode of Transport is missing
  • We are removing Air transport mode - that is outlier for this parameter
# 1. Car
# 2. Taxi
# 3. Bus
# 4. Express Bus
# 5. Subway
# 6. Airport Limousine
# 7. Rail
# 8. KTX (Express Rail)
# 9. Shuttle provided by travel agency
# 10. Air
# 11. Other
dat1 <- dat1 %>%
  mutate(ModeTransport = case_when(ModeTransport == 1 ~ "Private Car",
                                   ModeTransport == 2 ~ "Private Taxi",
                                   ModeTransport == 3 ~ "Public Bus",
                                   ModeTransport == 4 ~ "Public Bus",
                                   ModeTransport == 5 ~ "Subway",
                                   ModeTransport == 6 ~ "Private Taxi",
                                   ModeTransport == 7 ~ "Public Train",
                                   ModeTransport == 8 ~ "Public Train",
                                   ModeTransport == 9 ~ "Private Taxi",
                                   ModeTransport == 11 ~ "Other")) %>%
  filter(ModeTransport != 10)
# Regroup into maximum 3-4 buckets

4.1.13 Age

Modifying Age and add them into groups

#Youth (15-24 years) 
#Adults (25-64 years) 
#Seniors (65 years and over)
# dat1 <- dat1 %>%
#    mutate(AgeGroup = case_when(Age >=15 & Age <= 24 ~ "15-24",
#                                Age >=25 & Age <= 34 ~ "25-34",
#                                Age >=35 & Age <= 44 ~ "35-44",
#                                Age >=45 & Age <= 54 ~ "45-54",
#                                Age >=55 & Age <= 64 ~ "55-64",
#                                TRUE ~ "Senior"))

4.1.14 Occupation

Modifying Occupation

# 1. Entrepreneur, Senior management
# 2. Business (Corporate worker)
# 3. Sales, Service
# 4. Government, Military
# 5. Professionals (doctor, lawyer, professor)
# 6. industrial, manufacturing
# 7. self-employeed
# 8. Student
# 9. Housewife
# 10. Retired
# 11. None
# 12. Other

4.1.15 Income

Modifying Income Class

# 1. 30 Million Won or less
# 2. 30~50 Million Won
# 3. 50~80 Million Won
# 4. 80 ~100 Million Won
# 5. 100 ~150 Million Won
# 6. 150 ~200 Million Won
# 7. 200 Million Won or more

4.1.16 MileageAirline

Modifying Mileage Airline

  • There are no records in survey data where Mileage Airline is missing
# 1. Korea Airlines
# 2. Asian Airlines
# 3. Others
# 4. None
# 5. Unknown Missing Value
dat1 <- dat1 %>%
  mutate(MileageAirline = case_when(MileageAirline == 1 ~ "Korea Airlines",
                                   MileageAirline == 2 ~ "Asian Airlines",
                                   MileageAirline == 3 ~ "Others",
                                   MileageAirline == 4 ~ "None",
                                   MileageAirline == 5 ~ "Unknown Missing Value"))
#dat1 <- dat1 %>%
#  mutate(MileageAirline = ifelse(is.na(Mileage), 'No Mileage', 'Mileage'))         

4.2 Quantitative Variables - Handle Missing values

4.2.1 Age

Replacing missing values in age

  • There is 1 record where Age is missing in survey data provided
  • We have to check the distribution of Age before filling the missing value - So histogram
# Check distribution of age
hist(dat1$Age)

  • The distribution appears to be normal and not highly skewed/ outliers.
  • Hence, We use mean of the age of survey participants to replace missing entries
age_mean = mean(dat1$Age, na.rm = TRUE)
dat1$Age[is.na(dat1$Age)] <- age_mean
  • After this operation we are left with no records where this Age is missing

4.2.2 Trip Duration

4.2.3 Flying Companion

Not relevant

4.2.4 NoTripsLastYear

4.2.5 Airfare

Replacing missing values in Airfare

  • There are 155 records where Air fare is missing in survey data provided
  • We have to check the distribution of Airfare before filling the missing value - So histogram
  • We also have Airport_Airline_data -> raw data which can be used to treat missing values
# Check distribution of Airfare
hist(dat1$Airfare)

# change column name for join
names(airline_airfare)[names(airline_airfare) == 'Flight.Destination'] <- 'Destination'

dat1 = dat1 %>%
  left_join(airline_airfare, by = c("Airport", "Airline", "Destination")) %>%
  mutate(Airfare = ifelse(is.na(dat1$Airfare), `Average Price`/10000, Airfare))

# fill remaining from survey data average of group of -- Airport, Airlines, Destination, seat type
airfare_average = dat1 %>%
  group_by(Airport, Airline, Destination, SeatClass) %>%
  summarise(a_fare=mean(Airfare, na.rm = TRUE))

dat1 = dat1 %>%
  left_join(airfare_average, by = c("Airport", "Airline", "Destination", "SeatClass")) %>%
  mutate(Airfare = ifelse(is.na(dat1$Airfare), a_fare, Airfare))
  • Airfare in raw data file is multiplied by 10000, so to use that value we have to divide it by 10000
  • After this operation we are left with 6 records where this airfare is missing

4.2.6 NoTransport

4.2.7 AccessCost

Replacing missing values in Access Cost

  • There are 197 records where Access Cost is missing in survey data provided
  • We have to check the distribution of Access Cost before filling the missing value - So histogram
# Check distribution of AccessCost
hist(dat1$AccessCost)

  • We dont have any useful information for this parameter in Airport_Airline_data -> raw data so these can not be treated
  • Also, 197 is a almost 40% of total records, so we are not going to include this variable in any model

4.2.8 AccessTime

Replacing missing values in Access time

  • There are 97 records where Access time is missing in survey data provided
  • We have to check the distribution of Access time before filling the missing value - So histogram
  • We also have Airport_Airline_data -> raw data which can be used to treat missing values
# Check distribution of AccessTime
hist(dat1$AccessTime)

Replace missing values in Access time

dat1 = dat1 %>%
mutate(AccessTime = ifelse(is.na(dat1$AccessTime),
                             `Travel Time, minutes`,
                             AccessTime))
  • Access time in raw data file is given in the same units (Minutes), so we use that same value to treat missing, irrespective of no of transport modes used
  • After this operation we are left with no records where this Access time is missing

4.2.8.1 Mileage

Replacing missing values in Mileage

  • There are 398 records where Mileage is missing in survey data provided
  • We have to check the distribution of Mileage before filling the missing value - So histogram
  • We also have Airport_Airline_data -> raw data which can be used to treat missing values
# Check distribution of Mileage
hist(dat1$Mileage)

  • We dont have any useful information for this parameter in Airport_Airline_data -> raw data so these missings can not be treated
  • Also, 398 is a almost 80% of total records, so we are not going to include this variable in any model

4.2.9 Replace remaining NAs by 999 for purpose of EDA

# There are only 25 rows without NA, so instead of removing NAs replace it with 999
dat1[is.na(dat1)] <- 999

4.3 Adjust Dataset

4.3.1 Removing unnecessary columns

  • Remove ID column and Flight No column as they do not provide any benefit in this analysis because of their cardinal
  • Remove Departure Hr and Min as information is already captured in Departure Time
  • Remove Occupation - Not relevant for this study
  • Remove Mileage - 398 missing values
  • Remove MileageAirline - Missing values or unknown values
  • Remove AccessCost, Income - Lot of missing values
  • Remove FlyingCompanion, NoTripsLastYear - Not relevant for this study
# Remove ID column and Flight No column as they do not provide any benefit in this analysis because of their cardinal
dat1 <- dat1 %>%
  select(-c(FlightNo, ID))

# Remove Departure Hr and Min as information is already captured in Departure Time
dat1 <- dat1 %>%
  select(-c(DepartureHr, DepartureMn))

# Remove Occupation - Not relevant for this study
dat1 <- dat1 %>%
  select(-c(Occupation))

# Remove Mileage - 398 missing values
# Remove MileageAirline - Missing values or unknown values
dat1 <- dat1 %>%
  select(-c(Mileage, MileageAirline))

# Remove AccessCost - Lot of missing values
# Remove Income - missing values
dat1 <- dat1 %>%
  select(-c(Income))

# Not Relevant for our analysis
dat1 <- dat1 %>%
  select(-c(FlyingCompanion, NoTripsLastYear))

4.3.2 Getting the modified dataset as airport_survey_data

# Modify data
airport_survey_data <- dat1 %>%
  mutate(Airport = as.factor(Airport),
         Airline = as.factor(Airline),
         Nationality = as.factor(Nationality),
         TripPurpose = as.factor(TripPurpose),
         Province = as.factor(Province),
         Destination = as.factor(Destination),
         DepartureTime = as.factor(DepartureTime),
         SeatClass = as.factor(SeatClass),
         NoTransport = as.factor(NoTransport),
         ModeTransport = as.factor(ModeTransport),
         Gender = as.factor(Gender)) 
summary(airport_survey_data)
##  Airport               Airline         Age                  Gender   
##  GMP:245   Asiana Airlines :106   Min.   :17.00   Female       :249  
##  ICN:229   Foreign Airlines:137   1st Qu.:29.00   Male         :222  
##            Korean Air      :150   Median :37.00   Not Mentioned:  3  
##            Korean LCC      : 81   Mean   :39.83                      
##                                   3rd Qu.:50.00                      
##                                   Max.   :80.00                      
##                                                                      
##          Nationality    TripPurpose   TripDuration     GroupTravel    
##  China         : 32   Business:102   Min.   :  0.00   Min.   :0.0000  
##  Japan         : 40   Leisure :317   1st Qu.:  4.00   1st Qu.:0.0000  
##  Korea         :368   Other   : 29   Median :  5.00   Median :0.0000  
##  Other         : 14   Study   : 26   Mean   : 24.59   Mean   :0.1624  
##  Southeast Asia: 20                  3rd Qu.:  8.00   3rd Qu.:0.0000  
##                                      Max.   :730.00   Max.   :1.0000  
##                                                                       
##  FrequentFlightDestination         Destination     DepartureTime
##  Length:474                999           :  2   12pm - 6pm:205  
##  Class :character          China         :131   6am - 12pm: 46  
##  Mode  :character          Japan         :155   6pm - 9pm :194  
##                            Other         : 22   9pm - 6am : 29  
##                            Southeast Asia:164                   
##                                                                 
##                                                                 
##        SeatClass      Airfare       NoTransport      ModeTransport
##  999        :  4   Min.   :  3.00   1:339       Other       :  2  
##  Business   : 34   1st Qu.: 36.22   2:120       Private Car :106  
##  Economy    :424   Median : 46.83   3: 14       Private Taxi:176  
##  First Class: 12   Mean   : 61.20   4:  1       Public Bus  : 50  
##                    3rd Qu.: 58.00               Public Train: 39  
##                    Max.   :999.00               Subway      :101  
##                                                                   
##    AccessCost       AccessTime        Distance                 Province  
##  Min.   :     0   Min.   :  4.00   Min.   : 23.72    Seoul         :182  
##  1st Qu.:   999   1st Qu.: 30.00   1st Qu.: 46.00    Kyungki-do    :126  
##  Median :  1300   Median : 50.00   Median : 61.00    Jeju          : 89  
##  Mean   :  6785   Mean   : 62.36   Mean   :154.93    Kyungsang-do  : 27  
##  3rd Qu.:  8000   3rd Qu.: 64.00   3rd Qu.:363.00    Incheon       : 22  
##  Max.   :350000   Max.   :390.00   Max.   :455.00    Chungcheong-do: 13  
##                                                     (Other)        : 15  
##  Travel Time, minutes Average Price        a_fare      
##  Min.   : 39          Min.   :   999   Min.   : 14.00  
##  1st Qu.: 56          1st Qu.:   999   1st Qu.: 42.17  
##  Median : 64          Median :   999   Median : 46.83  
##  Mean   :103          Mean   :166373   Mean   : 61.20  
##  3rd Qu.:180          3rd Qu.:362160   3rd Qu.: 52.72  
##  Max.   :311          Max.   :529260   Max.   :999.00  
## 
attach(airport_survey_data)

4.3.3 Observations

  • 249 surveyed people travelled via GMP and 239 travelled via ICN
  • Maximum surveyed people travelled via Korean Air i.e. count is 153
  • Average age of surveyed people is 40 years
  • Maximum surveyed people have maximum trip purpose as Leisure
  • Average trip duration of surveyed people is 27.44
  • Maximum surveyed people travelled to Southeast Asia
  • Maximum surveyed people are from Seoul Province
  • Average travel time is 105 hours

4.3.4 Export dataset to be used for modeling in Python

# Selecting relevant columns for further developement of models in Python for analysis

# airport_choice <- airport_survey_data %>%
#    select(Airport, Airline,AccessTime, DepartureTime,
#          Age,Destination,Distance,AccessCost)
# 
# airline_choice <- airport_survey_data %>%
#   select(Airline,Airport,Airfare,DepartureTime,Destination)

# write.csv(airport_choice, "../code/airport_choice.csv")
# write.csv(airline_choice, "../code/airline_choice.csv")
write.csv(airport_survey_data, "../code/airport_survey_data.csv")

5 Basic EDA

5.1 Univariate non-graphical – Categorical

# Function to get the relevant grouped data frames
# df -> dataframe
# groupVariable -> the group by column 
# output -> the resulting data frame
getGroupedUnivariateData <- function(df, groupVariable) {
  # group_by() grouped variable
  # summarise() reduces variable to descriptive stat
  # count is the new variable name, n() is a counting function
  # percent is a new variable, sum() and nrow() are functions
  # tot_selling_price -> total selling price per grouped variable
  # order_by() tot_selling_price
  result <- df %>%             
    group_by_(groupVariable) %>%     
    summarise(count = n(),   
              percent = (sum(count) / nrow(df)) * 100,
              avg_airfare = (mean(Airfare)),
              avg_accessCost = (mean(AccessCost)),
              avg_accessTime = (mean(AccessTime))) %>%
    arrange(desc(percent)) %>%
    as.data.frame() 
  
  return(result)
}

5.1.1 Preferred Airport

# count, average access cost and average access time by airport
airport_data_airport_type <- getGroupedUnivariateData(airport_survey_data, "Airport")

airport_data_airport_type %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Airport count percent avg_airfare avg_accessCost avg_accessTime
GMP 245 51.68776 53.56583 6207.988 53.94694
ICN 229 48.31224 69.37374 7401.502 71.35808

Observations

  • Average airfare and average access cost from ICN airport is more than that of GMP airport, for surveyed people.

5.1.2 Preferred Airline

# count, average access cost and average access time by airline
airport_data_airline_type <- getGroupedUnivariateData(airport_survey_data, "Airline")

airport_data_airline_type %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Airline count percent avg_airfare avg_accessCost avg_accessTime
Korean Air 150 31.64557 56.99893 8698.613 62.55333
Foreign Airlines 137 28.90295 75.49061 5087.482 56.16788
Asiana Airlines 106 22.36287 54.11425 7168.349 69.98113
Korean LCC 81 17.08861 54.09939 5608.383 62.49383

Observations

  • Average airfare is maximum for ‘other airlines’ (NA)

5.1.3 Gender

# count, average access cost and average access time by gender
airport_data_gender <- getGroupedUnivariateData(airport_survey_data, "Gender")

airport_data_gender %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Gender count percent avg_airfare avg_accessCost avg_accessTime
Female 249 52.5316456 59.46850 5906.663 63.71084
Male 222 46.8354430 63.40278 7790.279 60.62162
Not Mentioned 3 0.6329114 42.38000 5233.333 78.66667

Observations

  • Average airfare is minimum for Female

5.1.4 Nationality

# count, average access cost and average access time by airport
airport_data_nationality <- getGroupedUnivariateData(airport_survey_data, "Nationality")

airport_data_nationality %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Nationality count percent avg_airfare avg_accessCost avg_accessTime
Korea 368 77.637131 64.72743 6943.886 59.80707
Japan 40 8.438819 53.48309 5474.450 57.00000
China 32 6.751055 48.24279 5540.250 66.06250
Southeast Asia 20 4.219409 47.33056 9749.700 100.90000
Other 14 2.953587 40.05821 4949.357 81.21429
  • 77% people in the survey are from Korea

5.1.5 Trip Purpose

# count, average access cost and average access time by airport
airport_data_TripPurpose <- getGroupedUnivariateData(airport_survey_data, "TripPurpose")

airport_data_TripPurpose %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
TripPurpose count percent avg_airfare avg_accessCost avg_accessTime
Leisure 317 66.877637 63.29966 5232.372 61.24290
Business 102 21.518987 62.16415 11337.343 65.51961
Other 29 6.118144 45.29150 9199.724 56.37931
Study 26 5.485232 49.61629 5155.308 70.23077
  • Approximate 66% people are traveling for leisure
  • More than 21% people are travelling for business purposes

5.1.6 Residence

# count, average access cost and average access time by destination
airport_data_Residence <- getGroupedUnivariateData(airport_survey_data, "Province")

airport_data_Residence %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Province count percent avg_airfare avg_accessCost avg_accessTime
Seoul 182 38.396625 56.23822 4293.868 43.39011
Kyungki-do 126 26.582279 59.49765 7224.571 55.61905
Jeju 89 18.776371 81.42943 8015.191 74.59551
Kyungsang-do 27 5.696203 64.06941 21316.519 141.85185
Incheon 22 4.641350 45.64090 1919.773 33.59091
Chungcheong-do 13 2.742616 42.61244 7492.000 78.46154
Jeonra-do 10 2.109705 51.00000 3219.200 188.50000
Kangwon-do 5 1.054852 46.60000 12680.000 108.00000
  • More than 60% people are from Seoul and Kyungki-do province, which is near to GMP airport as compared to ICN
  • Approximate 18% people are from Jeju province, which is almost equi-distant from both the airports

5.1.7 Destination

# count, average access cost and average access time by destination
airport_data_destination <- getGroupedUnivariateData(airport_survey_data, "Destination")

airport_data_destination %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Destination count percent avg_airfare avg_accessCost avg_accessTime
Southeast Asia 164 34.5991561 73.16723 8440.372 67.41463
Japan 155 32.7004219 51.03942 6089.342 55.20000
China 131 27.6371308 46.39340 5782.359 64.89313
Other 22 4.6413502 134.18831 4970.136 60.22727
999 2 0.4219409 35.00000 10500.000 60.00000
  • Southeast Asia, Japan and China are the famous destinations among survey participants

Observations

  • Other airlines have highest average airfare

5.1.8 Preferred Departure Time

# count, average access cost and average access time by province residence
airport_data_deptime <- getGroupedUnivariateData(airport_survey_data, "DepartureTime")

airport_data_deptime %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
DepartureTime count percent avg_airfare avg_accessCost avg_accessTime
12pm - 6pm 205 43.248945 52.91412 5753.839 54.15122
6pm - 9pm 194 40.928270 64.80852 8519.015 72.23196
6am - 12pm 46 9.704641 88.91833 5789.913 55.00000
9pm - 6am 29 6.118144 51.71462 4046.172 66.00000
  • As per the survey data, most people prefer departure time during the day. 12pm – 6pm – 9pm

5.1.9 Preferred Seat Class

# count, average access cost and average access time by seat class
airport_data_seatClass <- getGroupedUnivariateData(airport_survey_data, "SeatClass")

airport_data_seatClass %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
SeatClass count percent avg_airfare avg_accessCost avg_accessTime
Economy 424 89.4514768 50.62049 6758.300 61.88915
Business 34 7.1729958 86.23899 7811.294 68.64706
First Class 12 2.5316456 293.16667 6650.000 66.50000
999 4 0.8438819 274.25000 1249.500 46.25000

Observations

  • Average airfare and access cost is highest for Business class seats
  • Minimum average airfare for Incheon Province

Observations

5.1.10 Preferred Mode of Transport

# count, average access cost and average access time by province residence
airport_data_modeTransport <- getGroupedUnivariateData(airport_survey_data, "ModeTransport")

airport_data_modeTransport %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
ModeTransport count percent avg_airfare avg_accessCost avg_accessTime
Private Taxi 176 37.1308017 54.75351 10959.472 53.63068
Private Car 106 22.3628692 61.15355 2261.434 59.04717
Subway 101 21.3080169 62.00375 1521.762 51.53465
Public Bus 50 10.5485232 65.36405 2970.780 60.36000
Public Train 39 8.2278481 83.70869 17156.051 139.92308
Other 2 0.4219409 48.05000 37999.500 90.00000

Observations

  • Agency Shuttles provide the minimum average airfare
  • Car is the most preferred way of Transport Mode to the airport

5.2 Univariate graphical code – Categorical

# Code histograms using grid.arrange so can see all quant variables together
grid.arrange(
  airport_data_airport_type %>% 
    ggplot(aes(x = reorder(Airport, -count), y = count)) +  
    xlab("Airport") + 
    theme_classic() +
      geom_bar(stat = "identity", fill = "darkolivegreen3"),
  airport_data_airline_type %>% 
    ggplot(aes(x = reorder(Airline, -count), y = count)) +
    xlab("Airline") +
    theme_classic() +
      geom_bar(stat = "identity", fill = "darkolivegreen3"),
  airport_data_gender %>% 
    ggplot(aes(x = reorder(Gender, -count), y = count)) +
    xlab("Gender") +
    theme_classic() +
      geom_bar(stat = "identity", fill = "darkolivegreen3"),
  airport_data_seatClass %>% 
    ggplot(aes(x = reorder(SeatClass, -count), y = count)) +
    xlab("SeatClass") +
    theme_classic() +
      geom_bar(stat = "identity", fill = "darkolivegreen3"),
  airport_data_Residence %>% 
    ggplot(aes(x = reorder(Province, -count), y = count)) +
    xlab("Province") +
    theme_classic() +
      geom_bar(stat = "identity", fill = "darkolivegreen3"),
  airport_data_destination %>% 
    ggplot(aes(x = reorder(Destination, -count), y = count)) +
    xlab("Destination") +
    theme_classic() +
      geom_bar(stat = "identity", fill = "darkolivegreen3"),
  ncol = 2
) 

Observations

  • GMP and ICN have almost equal number of respondents
  • Korean airline is the most preferred
  • Females are more frequent travellers
  • Economy class is the most preferred Seat Class
  • Maximum number of passengers belong to Seoul Province
  • Southeast Asia is the most travelled destination

5.3 Univariate non-graphical – Quantitative

5.3.1 Descriptive Statistics

# Get descriptive stats 
summary(airport_survey_data)
##  Airport               Airline         Age                  Gender   
##  GMP:245   Asiana Airlines :106   Min.   :17.00   Female       :249  
##  ICN:229   Foreign Airlines:137   1st Qu.:29.00   Male         :222  
##            Korean Air      :150   Median :37.00   Not Mentioned:  3  
##            Korean LCC      : 81   Mean   :39.83                      
##                                   3rd Qu.:50.00                      
##                                   Max.   :80.00                      
##                                                                      
##          Nationality    TripPurpose   TripDuration     GroupTravel    
##  China         : 32   Business:102   Min.   :  0.00   Min.   :0.0000  
##  Japan         : 40   Leisure :317   1st Qu.:  4.00   1st Qu.:0.0000  
##  Korea         :368   Other   : 29   Median :  5.00   Median :0.0000  
##  Other         : 14   Study   : 26   Mean   : 24.59   Mean   :0.1624  
##  Southeast Asia: 20                  3rd Qu.:  8.00   3rd Qu.:0.0000  
##                                      Max.   :730.00   Max.   :1.0000  
##                                                                       
##  FrequentFlightDestination         Destination     DepartureTime
##  Length:474                999           :  2   12pm - 6pm:205  
##  Class :character          China         :131   6am - 12pm: 46  
##  Mode  :character          Japan         :155   6pm - 9pm :194  
##                            Other         : 22   9pm - 6am : 29  
##                            Southeast Asia:164                   
##                                                                 
##                                                                 
##        SeatClass      Airfare       NoTransport      ModeTransport
##  999        :  4   Min.   :  3.00   1:339       Other       :  2  
##  Business   : 34   1st Qu.: 36.22   2:120       Private Car :106  
##  Economy    :424   Median : 46.83   3: 14       Private Taxi:176  
##  First Class: 12   Mean   : 61.20   4:  1       Public Bus  : 50  
##                    3rd Qu.: 58.00               Public Train: 39  
##                    Max.   :999.00               Subway      :101  
##                                                                   
##    AccessCost       AccessTime        Distance                 Province  
##  Min.   :     0   Min.   :  4.00   Min.   : 23.72    Seoul         :182  
##  1st Qu.:   999   1st Qu.: 30.00   1st Qu.: 46.00    Kyungki-do    :126  
##  Median :  1300   Median : 50.00   Median : 61.00    Jeju          : 89  
##  Mean   :  6785   Mean   : 62.36   Mean   :154.93    Kyungsang-do  : 27  
##  3rd Qu.:  8000   3rd Qu.: 64.00   3rd Qu.:363.00    Incheon       : 22  
##  Max.   :350000   Max.   :390.00   Max.   :455.00    Chungcheong-do: 13  
##                                                     (Other)        : 15  
##  Travel Time, minutes Average Price        a_fare      
##  Min.   : 39          Min.   :   999   Min.   : 14.00  
##  1st Qu.: 56          1st Qu.:   999   1st Qu.: 42.17  
##  Median : 64          Median :   999   Median : 46.83  
##  Mean   :103          Mean   :166373   Mean   : 61.20  
##  3rd Qu.:180          3rd Qu.:362160   3rd Qu.: 52.72  
##  Max.   :311          Max.   :529260   Max.   :999.00  
## 

Observations

5.4 Univariate graphical – Quantitative

5.4.1 Histograms

# Code histograms using ggplot() so we can see all quant variables

grid.arrange(
  airport_survey_data %>% 
  ggplot(aes(Age)) +
    geom_histogram(fill = "darkolivegreen3") +
  xlab("Age (in years)")+
  theme_classic(),
  
  airport_survey_data %>% 
  ggplot(aes(Airfare)) +
    geom_histogram(fill = "darkolivegreen3") +
  xlab("Airfare (in 10,000 KRW)")+
  theme_classic(),
  
  airport_survey_data %>% 
  ggplot(aes(AccessCost/1000)) +
    geom_histogram(fill = "darkolivegreen3") +
  xlab("Transportation Cost to the Airport (in 1000 KRW)") +
  theme_classic() +
  xlim(0, 100) +
  ylim(0, 100),
  
  airport_survey_data %>% 
  ggplot(aes(AccessTime)) +
    geom_histogram(fill = "darkolivegreen3") +
  xlab("Transportation Time to the Airport (in minutes)")+
  theme_classic()
)

5.4.2 Observations

5.4.3 Boxplots of quantitative data

# get box plot of all the quanitative variables
par(mfrow = c(2, 2))
boxplot(Age, main = "Age (in years)")
boxplot(Airfare, main = "Airfare (in 10,000 KRW)")
boxplot(AccessCost/1000, main = "Transportation Cost to Airport(1000KRW)")
boxplot(AccessTime, main = "Transportation Time to Airport(in min)")

5.4.4 Observations

5.5 Multivariate non-graphical – Categorical

5.5.1 Cross-Tabs with Airport

5.5.1.1 Airport and Airlines Count

# Frequency count
dat1 %>%
  tabyl(Airport, Airline)
##  Airport Asiana Airlines Foreign Airlines Korean Air Korean LCC
##      GMP              52               93         76         24
##      ICN              54               44         74         57
# Make cross-tabs between two categorical variables using janitor package
# cross-tab of bedrooms and floors
airport_survey_data %>% 
  tabyl(Airline, Airport) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Airport
Airline GMP ICN Total
Asiana Airlines 10.97% 11.39% 22.36%
Foreign Airlines 19.62% 9.28% 28.90%
Korean Air 16.03% 15.61% 31.65%
Korean LCC 5.06% 12.03% 17.09%
Total 51.69% 48.31% 100.00%

5.5.1.2 Airport and Gender Count

# Frequency count
dat1 %>%
  tabyl(Airport, Gender)
##  Airport Female Male Not Mentioned
##      GMP    134  110             1
##      ICN    115  112             2
# cross-tab of bedrooms and waterfront

airport_survey_data %>% 
  tabyl(Gender, Airport) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2)%>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Airport
Gender GMP ICN Total
Female 28.27% 24.26% 52.53%
Male 23.21% 23.63% 46.84%
Not Mentioned 0.21% 0.42% 0.63%
Total 51.69% 48.31% 100.00%
  • Females prefer GMP Airport
  • Males have almost equal percent of travelling via GMP and ICN airports

5.5.1.3 Airport and Nationality Count

# Frequency count
dat1 %>%
  tabyl(Airport, Nationality)
##  Airport China Japan Korea Other Southeast Asia
##      GMP    26    39   171     7              2
##      ICN     6     1   197     7             18

5.5.1.4 Airport and Province Residence Count

# Frequency count
dat1 %>%
  tabyl(Airport, Province)
##  Airport  Chungcheong-do  Incheon  Jeju  Jeonra-do  Kangwon-do  Kyungki-do
##      GMP               6       11    69          5           3          59
##      ICN               7       11    20          5           2          67
##   Kyungsang-do  Seoul
##              5     87
##             22     95
# cross-tab of bedrooms and condition
airport_survey_data %>% 
  tabyl(Province, Airport) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Airport
Province GMP ICN Total
Chungcheong-do 1.27% 1.48% 2.74%
Incheon 2.32% 2.32% 4.64%
Jeju 14.56% 4.22% 18.78%
Jeonra-do 1.05% 1.05% 2.11%
Kangwon-do 0.63% 0.42% 1.05%
Kyungki-do 12.45% 14.14% 26.58%
Kyungsang-do 1.05% 4.64% 5.70%
Seoul 18.35% 20.04% 38.40%
Total 51.69% 48.31% 100.00%
  • ICN Airport is preferred by Seoul Province
  • GMP Airport is preferred by Jeju Province

5.5.1.5 Airport and Destination Count

# Frequency count
dat1 %>%
  tabyl(Airport, Destination)
##  Airport 999 China Japan Other Southeast Asia
##      GMP   2    87   130     4             22
##      ICN   0    44    25    18            142
# Make cross-tabs between two categorical variables using janitor package
# cross-tab of Airport and Destinationn
airport_survey_data %>% 
  tabyl(Destination, Airport) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2)%>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Airport
Destination GMP ICN Total
999 0.42% 0.00% 0.42%
China 18.35% 9.28% 27.64%
Japan 27.43% 5.27% 32.70%
Other 0.84% 3.80% 4.64%
Southeast Asia 4.64% 29.96% 34.60%
Total 51.69% 48.31% 100.00%

5.5.1.6 Airport and Frequent flight destination

# cross-tab of freq flight destination and Airport
airport_survey_data %>% 
  tabyl(FrequentFlightDestination, Airport) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Airport
FrequentFlightDestination GMP ICN Total
999 2.53% 4.01% 6.54%
China 16.24% 5.27% 21.52%
Europe 1.90% 3.59% 5.49%
Japan 18.99% 4.22% 23.21%
None 1.05% 1.48% 2.53%
North/South America 1.27% 1.27% 2.53%
Other 3.80% 2.53% 6.33%
Southeast Asia 5.91% 25.95% 31.86%
Total 51.69% 48.31% 100.00%
  • With approximately 19%, Japan is the frequent flight destination from GMP

5.5.1.7 Airport and Departure Time Count

# Frequency count
dat1 %>%
  tabyl(Airport, DepartureTime)
##  Airport 12pm - 6pm 6am - 12pm 6pm - 9pm 9pm - 6am
##      GMP        135         35        74         1
##      ICN         70         11       120        28

5.5.2 Cross-Tabs with Airline

5.5.2.1 Airline and Gender Count

# Frequency count
dat1 %>%
  tabyl(Airline, Gender)
##           Airline Female Male Not Mentioned
##   Asiana Airlines     52   53             1
##  Foreign Airlines     71   66             0
##        Korean Air     77   73             0
##        Korean LCC     49   30             2
# cross-tab of Airline and Gender
airport_survey_data %>% 
  tabyl(Airline, Gender) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Gender
Airline Female Male Not Mentioned Total
Asiana Airlines 10.97% 11.18% 0.21% 22.36%
Foreign Airlines 14.98% 13.92% 0.00% 28.90%
Korean Air 16.24% 15.40% 0.00% 31.65%
Korean LCC 10.34% 6.33% 0.42% 17.09%
Total 52.53% 46.84% 0.63% 100.00%

5.5.2.2 Airline and Nationality Count

# Frequency count
dat1 %>%
  tabyl(Airline, Nationality)
##           Airline China Japan Korea Other Southeast Asia
##   Asiana Airlines     5     8    84     4              5
##  Foreign Airlines    18    20    90     4              5
##        Korean Air     9    10   123     4              4
##        Korean LCC     0     2    71     2              6
# cross-tab of Airline and Nationality
airport_survey_data %>% 
  tabyl(Airline, Nationality) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Nationality
Airline China Japan Korea Other Southeast Asia Total
Asiana Airlines 1.05% 1.69% 17.72% 0.84% 1.05% 22.36%
Foreign Airlines 3.80% 4.22% 18.99% 0.84% 1.05% 28.90%
Korean Air 1.90% 2.11% 25.95% 0.84% 0.84% 31.65%
Korean LCC 0.00% 0.42% 14.98% 0.42% 1.27% 17.09%
Total 6.75% 8.44% 77.64% 2.95% 4.22% 100.00%

5.5.2.3 Airline and Nationality Count

# Frequency count
dat1 %>%
  tabyl(Airline, Destination)
##           Airline 999 China Japan Other Southeast Asia
##   Asiana Airlines   0    32    47     3             24
##  Foreign Airlines   0    46    53    10             28
##        Korean Air   2    53    48     9             38
##        Korean LCC   0     0     7     0             74
# cross-tab of Airline and Destination
airport_survey_data %>% 
  tabyl(Airline, Destination) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Destination
Airline 999 China Japan Other Southeast Asia Total
Asiana Airlines 0.00% 6.75% 9.92% 0.63% 5.06% 22.36%
Foreign Airlines 0.00% 9.70% 11.18% 2.11% 5.91% 28.90%
Korean Air 0.42% 11.18% 10.13% 1.90% 8.02% 31.65%
Korean LCC 0.00% 0.00% 1.48% 0.00% 15.61% 17.09%
Total 0.42% 27.64% 32.70% 4.64% 34.60% 100.00%

5.5.2.4 Airline and DepartureTime Count

# Frequency count
dat1 %>%
  tabyl(Airline, DepartureTime)
##           Airline 12pm - 6pm 6am - 12pm 6pm - 9pm 9pm - 6am
##   Asiana Airlines         39          4        55         8
##  Foreign Airlines         60          8        61         8
##        Korean Air         89         15        43         3
##        Korean LCC         17         19        35        10
# cross-tab of Airline and DepartureTime
airport_survey_data %>% 
  tabyl(Airline, DepartureTime) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
DepartureTime
Airline 12pm - 6pm 6am - 12pm 6pm - 9pm 9pm - 6am Total
Asiana Airlines 8.23% 0.84% 11.60% 1.69% 22.36%
Foreign Airlines 12.66% 1.69% 12.87% 1.69% 28.90%
Korean Air 18.78% 3.16% 9.07% 0.63% 31.65%
Korean LCC 3.59% 4.01% 7.38% 2.11% 17.09%
Total 43.25% 9.70% 40.93% 6.12% 100.00%

5.5.3 Trip Purpose and Destination

# cross-tab of floors and waterfront
airport_survey_data %>% 
  tabyl(TripPurpose, Destination) %>% 
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages(denominator = "all") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_title() %>%
   kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(1, background = "#e5f5e0")
Destination
TripPurpose 999 China Japan Other Southeast Asia Total
Business 0.00% 7.81% 9.70% 0.84% 3.16% 21.52%
Leisure 0.21% 14.77% 18.78% 3.38% 29.75% 66.88%
Other 0.21% 2.74% 1.69% 0.21% 1.27% 6.12%
Study 0.00% 2.32% 2.53% 0.21% 0.42% 5.49%
Total 0.42% 27.64% 32.70% 4.64% 34.60% 100.00%

5.5.4 Observations

  • Japan is mostly travelled by the passengers for business purposes
  • Southeast Asia is favorite for Leisure purpose

5.6 Multivariate graphical code – Categorical

#Function for heatmap
 heat_map <- function (x, y,y_title,x_title){
   x<-enquo(x)
   y<-enquo(y)
   
   
   airport_survey_data %>%
     group_by(!!x,!!y) %>%
     summarise(count = n())  %>%
     ggplot(aes(!!x,!!y)) +
     geom_tile(aes(fill = count))+
     ylab(y_title) + xlab(x_title)+
     theme_classic() +
     scale_fill_continuous(guide = guide_legend(title = "Count")) +
     scale_fill_gradient(low = "white", high = "darkolivegreen3")
     
 }

grid.arrange(
heat_map(Airport, Airline, "Airline", "Airport"),
heat_map(Airport, Destination, "Destination", "Airport"),
heat_map(Airport, Province, "Province", "Airport"),
heat_map(TripPurpose, Destination, "Destination", "TripPurpose"),
nrow=2)

heat_map(Airport, TripPurpose, "TripPurpose", "Airport" )

heat_map(Airline, TripPurpose, "TripPurpose", "Airline" )

heat_map(GroupTravel, Airline, "GroupTravel", "Airport" )

heat_map(GroupTravel, Airport, "GroupTravel", "Airline" )

heat_map(Gender, Airline, "Airline", "Gender" )

heat_map(Gender, Airport, "Airport", "Gender" )

ggplot(dat1, aes(x=Age, y=Airline)) +
  geom_point(size=2, shape=23)

ggplot(dat1, aes(x=Age, y=Airport)) +
  geom_point(size=2, shape=23)

ggplot(dat1, aes(x=Age, y=Gender, color=Airport)) +
  geom_point(size=2, shape=23)

ggplot(dat1, aes(x=Age, y=Gender, color=Airline)) +
  geom_point(size=2, shape=23)

ggplot(dat1, aes(x=Destination, y=Airfare, color=Airline)) +
  geom_point(size=2, shape=23) + ylim(0, 270)
## Warning: Removed 6 rows containing missing values (geom_point).

5.7 Multivariate non-graphical – Quantitative

The standard measure between quantitative variables is correlation

5.7.1 Correlation plots

# Find correlation of quantitative variables
#cor_plot <- dat1 %>%
#  select(-c(FrequentFlightDestination, -MileageAirline))

#cor(cor_plot) 

5.7.2 Observations

5.8 Multivariate graphical code – Quantitative

5.8.1 Pairwise Plot

# Correlation plot
dat1 %>%
  select(Airport, Airline, Age, Gender, TripPurpose, TripDuration, Destination, DepartureTime, Province, Airfare, AccessCost, AccessTime) %>%   
  pairs.panels(hist.col="darkolivegreen3")

6 Detailed EDA

6.1 Airlines vs Airfare

airport_airline_airfare_graph <- ggplot(airport_survey_data, aes(x = Airline, y = Airfare, color = Airport)) + 
geom_point(alpha = 0.5) +
   ylim(0, 300) +
  geom_smooth(method=lm, se=FALSE) +
theme_classic()+
theme(legend.title = element_text(size=10),
      plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
      plot.subtitle = element_text(face = "italic")) +
labs(x = 'Airlines', y = 'Airfare  (10,000 Korean Won, KRW)', title = "ICN airport has higher airfares") +
  scale_fill_brewer(palette = "blue")

 airport_airline_airfare_graph

airport_airline_bargraph <- airport_survey_data %>%
  group_by(Airport, Airline) %>%
  summarize (Avg_Fare = mean(Airfare)) 

ggplot(data = airport_airline_bargraph, aes(x = Airline, y = Avg_Fare, fill = Airport)) +
  geom_bar(stat="identity", position="dodge") +
  coord_flip() +
    scale_fill_brewer() +
  labs(x = 'Airlines', y = 'Airfare  (10,000 Korean Won, KRW)', title = "ICN airport has higher average airfares") +
  theme_classic()+
theme(legend.title = element_text(size=10),
      plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
      plot.subtitle = element_text(face = "italic"))

airport_gender_age_graph <- ggplot(airport_survey_data, aes(x = Gender, y = Age, color = Airport)) + 
geom_point(alpha = 0.5) +
   ylim(0, 85) +
  geom_smooth(method=lm, se=FALSE, color="black") +
theme_classic()+
theme(legend.title = element_text(size=10),
      plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
      plot.subtitle = element_text(face = "italic")) +
labs(x = 'Gender', y = 'Age (Years)', title = "Older people prefer GMP airport") +
guides(color = guide_legend(title = '  Airports')) +
  scale_fill_brewer()

 airport_gender_age_graph

airport_gender_age_bargraph <- airport_survey_data %>%
  group_by(Airport, Gender) %>%
  summarize (Avg_age = mean(Age)) 

ggplot(data = airport_gender_age_bargraph, aes(x = Gender, y = Avg_age, fill = Airport)) +
  geom_bar(stat="identity", position="dodge") +
  coord_flip() +
     scale_fill_brewer()  +
  labs(x = 'Gender', y = 'Age (Years)', title = "Among surveyed, average age of people going to GMP is higher") +
  theme_classic()+
theme(legend.title = element_text(size=10),
      plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
      plot.subtitle = element_text(face = "italic"))

airport_destination_bargraph <- airport_survey_data %>%
  group_by(Airport, Destination) %>%
  summarize (Count = n())

ggplot(data = airport_destination_bargraph, aes(x = Destination, y = Count, fill = Airport)) +
  geom_bar(stat="identity", position="dodge") +
  coord_flip() +
     scale_fill_brewer()  +
  labs(x = 'Destination', y = 'Count', title = "People prefer GMP airport while travelling to Japan and China", subtitle = "People prefer ICN airport while travelling to South East Asia and other airports") +
  theme_classic()+
theme(legend.title = element_text(size=10),
      plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
      plot.subtitle = element_text(face = "italic"))

airport_province_bargraph <- airport_survey_data %>%
  group_by(Airport, Province) %>%
  summarize (Count = n())

ggplot(data = airport_province_bargraph, aes(x = Province, y = Count, fill = Airport)) +
  geom_bar(stat="identity", position="dodge") +
  coord_flip() +
     scale_fill_brewer()  +
  labs(x = 'Province', y = 'Count', title = "More travelled via GMP from Seoul,Kyungsang-do,Kyungki-do,Chungcheong-do", subtitle = "More travelled via ICN from Kangwon-do and Jeju") +
  theme_classic()+
theme(legend.title = element_text(size=10),
      plot.title = element_text(hjust = 0.6, color = 'black'),
      plot.subtitle = element_text(face = "italic"))

airport_transport_travelTime_graph <- airport_survey_data %>%
  group_by(Airport, ModeTransport) %>%
  summarize (Average_Travel_Time = mean(`Travel Time, minutes`))

ggplot(data = airport_transport_travelTime_graph, aes(x = ModeTransport, y = Average_Travel_Time, fill = Airport)) +
  #geom_bar(stat="identity", position="dodge") +
  geom_bar(stat = "identity", position = position_dodge2(preserve = "single", padding = 0)) +
  coord_flip() +
      scale_fill_brewer() +
  labs(x = 'Mode of Transport', y = 'Average Travel Time (in minutes)', title = "Average Travel Time to ICN is more") +
  theme_classic()+
theme(legend.title = element_text(size=10),
      plot.title = element_text(hjust = 0.6, face = 'bold', color = 'black'),
      plot.subtitle = element_text(face = "italic"))

ggplot(dat1, aes(fill=Airline, y=Airfare, x=Airline)) + 
    geom_bar(position="dodge", stat="summary", fun.y="mean")

6.1.1 Observations

7 Statistical EDA

7.1 Chi-squared test

7.1.1 Airport and Airline

chisq.test(table(airport_survey_data$Airport, airport_survey_data$Airline))
## 
##  Pearson's Chi-squared test
## 
## data:  table(airport_survey_data$Airport, airport_survey_data$Airline)
## X-squared = 30.529, df = 3, p-value = 1.068e-06

7.1.2 Airport and Age

chisq.test(table(airport_survey_data$Airport, airport_survey_data$Age))
## 
##  Pearson's Chi-squared test
## 
## data:  table(airport_survey_data$Airport, airport_survey_data$Age)
## X-squared = 77.248, df = 57, p-value = 0.03835

7.1.3 Airport and Destination

chisq.test(table(airport_survey_data$Airport, airport_survey_data$Destination))
## 
##  Pearson's Chi-squared test
## 
## data:  table(airport_survey_data$Airport, airport_survey_data$Destination)
## X-squared = 183.63, df = 4, p-value < 2.2e-16

7.1.4 Airport and Airfare

chisq.test(table(airport_survey_data$Airport, airport_survey_data$Airfare))
## 
##  Pearson's Chi-squared test
## 
## data:  table(airport_survey_data$Airport, airport_survey_data$Airfare)
## X-squared = 237.84, df = 90, p-value = 2.739e-15

7.2 t-test

t.test(airport_survey_data$Airfare[airport_survey_data$Airport == "ICN"], airport_survey_data$Airfare[airport_survey_data$Airport == "GMP"])
## 
##  Welch Two Sample t-test
## 
## data:  airport_survey_data$Airfare[airport_survey_data$Airport == "ICN"] and airport_survey_data$Airfare[airport_survey_data$Airport == "GMP"]
## t = 1.5608, df = 401.61, p-value = 0.1194
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.102883 35.718697
## sample estimates:
## mean of x mean of y 
##  69.37374  53.56583
t.test(airport_survey_data$Age[airport_survey_data$Airport == "ICN"], airport_survey_data$Age[airport_survey_data$Airport == "GMP"])
## 
##  Welch Two Sample t-test
## 
## data:  airport_survey_data$Age[airport_survey_data$Airport == "ICN"] and airport_survey_data$Age[airport_survey_data$Airport == "GMP"]
## t = -3.1916, df = 467.59, p-value = 0.00151
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.338893 -1.507751
## sample estimates:
## mean of x mean of y 
##  37.79913  41.72245
t.test(airport_survey_data$`Travel Time, minutes`[airport_survey_data$Airport == "ICN"], airport_survey_data$`Travel Time, minutes`[airport_survey_data$Airport == "GMP"])
## 
##  Welch Two Sample t-test
## 
## data:  airport_survey_data$`Travel Time, minutes`[airport_survey_data$Airport ==  and airport_survey_data$`Travel Time, minutes`[airport_survey_data$Airport ==     "ICN"] and     "GMP"]
## t = 0.82831, df = 454.09, p-value = 0.4079
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.255885 20.285900
## sample estimates:
## mean of x mean of y 
##  106.1048  100.0898

7.3 ANOVA test

# Compute the analysis of variance
res.aov <- aov(Airfare ~ Airport + Airline + Gender + Destination, data = airport_survey_data)
# Summary of the analysis
summary(res.aov)
##              Df  Sum Sq Mean Sq F value Pr(>F)   
## Airport       1   29578   29578   2.553 0.1107   
## Airline       3   59951   19984   1.725 0.1610   
## Gender        2    1159     580   0.050 0.9512   
## Destination   4  170079   42520   3.670 0.0059 **
## Residuals   463 5363580   11584                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.3.1 Preparing the Data

summary(dat1)
##    Airport            Airline               Age           Gender         
##  Length:474         Length:474         Min.   :17.00   Length:474        
##  Class :character   Class :character   1st Qu.:29.00   Class :character  
##  Mode  :character   Mode  :character   Median :37.00   Mode  :character  
##                                        Mean   :39.83                     
##                                        3rd Qu.:50.00                     
##                                        Max.   :80.00                     
##                                                                          
##  Nationality        TripPurpose         TripDuration     GroupTravel    
##  Length:474         Length:474         Min.   :  0.00   Min.   :0.0000  
##  Class :character   Class :character   1st Qu.:  4.00   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median :  5.00   Median :0.0000  
##                                        Mean   : 24.59   Mean   :0.1624  
##                                        3rd Qu.:  8.00   3rd Qu.:0.0000  
##                                        Max.   :730.00   Max.   :1.0000  
##                                                                         
##  FrequentFlightDestination Destination        DepartureTime     
##  Length:474                Length:474         Length:474        
##  Class :character          Class :character   Class :character  
##  Mode  :character          Mode  :character   Mode  :character  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##   SeatClass            Airfare        NoTransport    ModeTransport     
##  Length:474         Min.   :  3.00   Min.   :1.000   Length:474        
##  Class :character   1st Qu.: 36.22   1st Qu.:1.000   Class :character  
##  Mode  :character   Median : 46.83   Median :1.000   Mode  :character  
##                     Mean   : 61.20   Mean   :1.319                     
##                     3rd Qu.: 58.00   3rd Qu.:2.000                     
##                     Max.   :999.00   Max.   :4.000                     
##                                                                        
##    AccessCost       AccessTime        Distance                 Province  
##  Min.   :     0   Min.   :  4.00   Min.   : 23.72    Seoul         :182  
##  1st Qu.:   999   1st Qu.: 30.00   1st Qu.: 46.00    Kyungki-do    :126  
##  Median :  1300   Median : 50.00   Median : 61.00    Jeju          : 89  
##  Mean   :  6785   Mean   : 62.36   Mean   :154.93    Kyungsang-do  : 27  
##  3rd Qu.:  8000   3rd Qu.: 64.00   3rd Qu.:363.00    Incheon       : 22  
##  Max.   :350000   Max.   :390.00   Max.   :455.00    Chungcheong-do: 13  
##                                                     (Other)        : 15  
##  Travel Time, minutes Average Price        a_fare      
##  Min.   : 39          Min.   :   999   Min.   : 14.00  
##  1st Qu.: 56          1st Qu.:   999   1st Qu.: 42.17  
##  Median : 64          Median :   999   Median : 46.83  
##  Mean   :103          Mean   :166373   Mean   : 61.20  
##  3rd Qu.:180          3rd Qu.:362160   3rd Qu.: 52.72  
##  Max.   :311          Max.   :529260   Max.   :999.00  
## 
summary(airport_survey_data)
##  Airport               Airline         Age                  Gender   
##  GMP:245   Asiana Airlines :106   Min.   :17.00   Female       :249  
##  ICN:229   Foreign Airlines:137   1st Qu.:29.00   Male         :222  
##            Korean Air      :150   Median :37.00   Not Mentioned:  3  
##            Korean LCC      : 81   Mean   :39.83                      
##                                   3rd Qu.:50.00                      
##                                   Max.   :80.00                      
##                                                                      
##          Nationality    TripPurpose   TripDuration     GroupTravel    
##  China         : 32   Business:102   Min.   :  0.00   Min.   :0.0000  
##  Japan         : 40   Leisure :317   1st Qu.:  4.00   1st Qu.:0.0000  
##  Korea         :368   Other   : 29   Median :  5.00   Median :0.0000  
##  Other         : 14   Study   : 26   Mean   : 24.59   Mean   :0.1624  
##  Southeast Asia: 20                  3rd Qu.:  8.00   3rd Qu.:0.0000  
##                                      Max.   :730.00   Max.   :1.0000  
##                                                                       
##  FrequentFlightDestination         Destination     DepartureTime
##  Length:474                999           :  2   12pm - 6pm:205  
##  Class :character          China         :131   6am - 12pm: 46  
##  Mode  :character          Japan         :155   6pm - 9pm :194  
##                            Other         : 22   9pm - 6am : 29  
##                            Southeast Asia:164                   
##                                                                 
##                                                                 
##        SeatClass      Airfare       NoTransport      ModeTransport
##  999        :  4   Min.   :  3.00   1:339       Other       :  2  
##  Business   : 34   1st Qu.: 36.22   2:120       Private Car :106  
##  Economy    :424   Median : 46.83   3: 14       Private Taxi:176  
##  First Class: 12   Mean   : 61.20   4:  1       Public Bus  : 50  
##                    3rd Qu.: 58.00               Public Train: 39  
##                    Max.   :999.00               Subway      :101  
##                                                                   
##    AccessCost       AccessTime        Distance                 Province  
##  Min.   :     0   Min.   :  4.00   Min.   : 23.72    Seoul         :182  
##  1st Qu.:   999   1st Qu.: 30.00   1st Qu.: 46.00    Kyungki-do    :126  
##  Median :  1300   Median : 50.00   Median : 61.00    Jeju          : 89  
##  Mean   :  6785   Mean   : 62.36   Mean   :154.93    Kyungsang-do  : 27  
##  3rd Qu.:  8000   3rd Qu.: 64.00   3rd Qu.:363.00    Incheon       : 22  
##  Max.   :350000   Max.   :390.00   Max.   :455.00    Chungcheong-do: 13  
##                                                     (Other)        : 15  
##  Travel Time, minutes Average Price        a_fare      
##  Min.   : 39          Min.   :   999   Min.   : 14.00  
##  1st Qu.: 56          1st Qu.:   999   1st Qu.: 42.17  
##  Median : 64          Median :   999   Median : 46.83  
##  Mean   :103          Mean   :166373   Mean   : 61.20  
##  3rd Qu.:180          3rd Qu.:362160   3rd Qu.: 52.72  
##  Max.   :311          Max.   :529260   Max.   :999.00  
## 

7.3.2 Remove Highly correlated variables

# ProvinceDistance and Access time # Correlated with Province
dat1 <- dat1 %>%
  select(-c(Distance, AccessTime))

# TravelTime, minutes?
  • Province Distance is highly correlated with Provice, so removed from the dataset
  • Access Time, time taken by individual from province to airport is also highly correlated with Provice and distance, so removed from the dataset

7.3.3 Remove unrelated variables

# Remove AccessCost - Lot of missing values
# Remove FrequentFlightDestination
# GroupTravel
# TripDuration
# Gender
# AgeGroup
# SeatClass
# NoTransport

dat1 <- dat1 %>%
  select(-c(AccessCost, FrequentFlightDestination, GroupTravel,
            TripDuration, Gender, SeatClass, NoTransport, TripPurpose,
            'Average Price', a_fare, 'Travel Time, minutes', ModeTransport))

## Remove rows with Destination 999
dat1 <- dat1 %>%
  filter(Destination != "999")

## Remove rows with Airfare 999
dat1 <- dat1 %>%
  filter(Airfare != "999")
  • Remove AccessCost is removed from final dataset - Lot of missing values
  • Remove FrequentFlightDestination - not used for this study
  • GroupTravel - not used for this study
  • TripDuration - not used for this study
  • Gender - no impact of gender on any of the choice (Airport, Airline)
  • AgeGroup - no impact of age on any of the choice (Airport, Airline)
  • SeatClass - Almost all survey participants travelled in economy, so data is biased
  • NoTransport - Not relevant for this study

8 Save the dataset for use in scikit learn

write.xlsx(dat1, 'export_from_code.xlsx')
## Note: zip::zip() is deprecated, please use zip::zipr() instead
# Write CSV in R
write.csv(dat1, file = "export_from_code.csv",row.names=FALSE)

9 Descriptive Statistics of final dataset

Skim summary statistics
n obs: 466
n variables: 8

Variable type: factor
variable missing complete n n_unique top_counts ordered
Airline 0 466 466 4 Kor: 148, For: 133, Asi: 105, Kor: 80 FALSE
Airport 0 466 466 2 GMP: 241, ICN: 225, NA: 0 FALSE
DepartureTime 0 466 466 4 12p: 203, 6pm: 191, 6am: 43, 9pm: 29 FALSE
Destination 0 466 466 4 Sou: 160, Jap: 154, Chi: 131, Oth: 21 FALSE
Nationality 0 466 466 5 Kor: 361, Jap: 39, Chi: 32, Sou: 20 FALSE
Province 0 466 466 8 Se: 181, Ky: 122, Je: 86, Ky: 27 FALSE
Variable type: numeric
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
Age 0 466 466 39.71 13.59 17 29 37 50 80 ▃▇▆▅▃▃▂▁
Airfare 0 466 466 49.24 24.52 3 36.22 46.83 56 260 ▃▇▁▁▁▁▁▁